Genome Biology
○ Springer Science and Business Media LLC
All preprints, ranked by how well they match Genome Biology's content profile, based on 555 papers previously published here. The average preprint has a 0.30% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Hüther, P.; Hagmann, J.; Nunn, A.; Kakoulidou, I.; Pisupati, R.; Langenberger, D.; Weigel, D.; Johannes, F.; Schultheiss, S. J.; Becker, C.
Show abstract
Whole-genome bisulfite sequencing (WGBS) is the standard method for profiling DNA methylation at single-nucleotide resolution. Many WGBS-based studies aim to identify biologically relevant loci that display differential methylation between genotypes, treatment groups, tissues, or developmental stages. Over the years, different tools have been developed to extract differentially methylated regions (DMRs) from whole-genome data. Often, such tools are built upon assumptions from mammalian data and do not consider the substantially more complex and variable nature of plant DNA methylation. Here, we present MethylScore, a pipeline to analyze WGBS data and to account for plant-specific DNA methylation properties. MethylScore processes data from genomic alignments to DMR output and is designed to be usable by novice and expert users alike. It uses an unsupervised machine learning approach to segment the genome by classification into states of high and low methylation, substantially reducing the number of necessary statistical tests while increasing the signal-to-noise ratio and the statistical power. We show how MethylScore can identify DMRs from hundreds of samples and how its data-driven approach can stratify associated samples without prior information. We identify DMRs in the A. thaliana 1001 Genomes dataset to unveil known and unknown genotype-epigenotype associations. MethylScore is an accessible pipeline for plant WGBS data, with unprecedented features for DMR calling in small- and large-scale datasets; it is built as a Nextflow pipeline and its source code is available at https://github.com/Computomics/MethylScore.
Demetci, P.; Tran, Q. H.; Redko, I.; Singh, R.
Show abstract
The availability of various single-cell sequencing technologies allows one to jointly study multiple genomic features and understand how they interact to regulate cells. Although there are experimental challenges to simultaneously profile multiple features on the same single cell, recent computational methods can align the cells from unpaired multi-omic datasets. However, studying regulation also requires us to map the genomic features across different measurements. Unfortunately, most single-cell multi-omic alignment tools cannot perform these alignments or need prior knowledge. We introduce O_SCPLOWSCOOTRC_SCPLOW, a co-optimal transport-based method, which jointly aligns both cells and genomic features of unpaired single-cell multi-omic datasets. We apply O_SCPLOWSCOOTRC_SCPLOW to various single-cell multi-omic datasets with different types of measurements. Our results show that O_SCPLOWSCOOTRC_SCPLOW provides quality alignments for unsupervised cell-level and feature-level integration of datasets with sparse feature correspondences (e.g., one-to-one mappings). For datasets with dense feature correspondences (e.g., many-to-many mappings), our joint framework allows us to provide supervision on one level (e.g., cell types), thus improving alignment performance on the other (e.g., genomic features) or vice-versa. The unique joint alignment framework makes O_SCPLOWSCOOTRC_SCPLOW a helpful hypothesis-generation tool for the integrative study of unpaired single-cell multi-omic datasets. Available at: https://github.com/rsinghlab/SCOOTR.
Rahmani, E.; Jew, B.; Schweiger, R.; Rhead, B.; Criswell, L. A.; Barcellos, L. F.; Eskin, E.; Rosset, S.; Sankararaman, S.; Halperin, E.
Show abstract
We benchmarked two approaches for the detection of cell-type-specific differential DNA methylation: Tensor Composition Analysis (TCA) and a regression model with interaction terms (CellDMC). Our experiments alongside rigorous mathematical explanations show that TCA is superior over CellDMC, thus resolving recent criticisms suggested by Jing et al. Following misconceptions by Jing and colleagues with modelling cell-type-specificity and the application of TCA, we further discuss best practices for performing association studies at cell-type resolution. The scripts for reproducing all of our results and figures are publicly available at github.com/cozygene/CellTypeSpecificMethylationAnalysis.
Balaj, L.; Lee, H.; Gashi, D.; Batool, S. M.; Escobedo, A. K.; Carter, B. S.
Show abstract
m6ASeqTools is an R package designed to streamline the post-processing and interpretation of site-level m6A predictions from m6Anet. It provides descriptive summaries of m6A distribution across transcripts, genes, biotypes, and transcript regions, and enables condition comparisons through a calculated weighted modification ratio. By integrating differential gene expression data, the package links methylation changes with expression differences, providing biotype-specific and region-specific insights into how m6A localization patterns relate to transcriptional regulation. Availability and implementationm6ASeqTools is freely available at https://github.com/hannalee809/m6ASeqTools Supplementary informationSupplementary data are available at Bioinformatics online.
Song, Y.; Papatheodorou, I.; Brazma, A.
Show abstract
The cross-species comparison of expression profiles uncovers functional similarities and differences between cell types and helps refining their evolutionary relationships. Current analysis strategies typically follow the ortholog conjecture, which posits that the expression of orthologous genes is most similar between species. However, the extent to which this holds true at different evolutionary distances is unknown. Here, we systematically explore the ortholog conjecture in comparative scRNA-seq data. We devise a robust analytical framework, GeneSpectra, to classify genes by expression specificity and distribution across cell types. Our analysis reveals that genes expressed ubiquitously across nearly all cell types exhibit strong conservation of this pattern across species, as do genes with high expression specificity. In contrast, genes within intermediate specificity fluctuate between classes. As expected, ortholog expression becomes more divergent with increased species distance. We also find an overall correlation between similarity in expression profiles and sequence conservation. Finally, our results allow identifying gene classes with highest probability of expression pattern conservation that are most useful for cell type alignment between species. Calibrating reliance on the ortholog conjecture for individual genes, we thus provide a comprehensive framework for the comparative analysis of single cell data.
Gynter, A.; Meistermann, D.; Lahdesmaki, H.; Kilpinen, H.
Show abstract
Bulk RNA-Seq remains a widely adopted technique to profile gene expression, primarily due to the persistent challenges associated with achieving single-cell resolution. However, a key challenge is accurately estimating the proportions of different cell types within these bulk samples. To address this issue, we introduce DeconV, a probabilistic framework for cell-type deconvolution that uses scRNA-Seq data as a reference. This approach aims to mitigate some of the limitations in existing methods by incorporating statistical frameworks developed for scRNA-Seq, thereby simplifying issues related to reference preprocessing such as normalization and marker gene selection. We benchmarked DeconV against established methods, including MuSiC, CIBERSORTx, and Scaden. Our results show that DeconV performs comparably in terms of accuracy to the best-performing method, Scaden, but provides additional interpretability by offering confidence intervals for its predictions. Furthermore, the modular design of DeconV allows for the investigation of discrepancies between bulk-sequenced samples and artificially generated pseudo-bulk samples.
Ioannou, A.; Friman, E. T.; Daub, C. O.; Bickmore, W. A.; Biddie, S. C.
Show abstract
Plasma cell-free RNA (cfRNA) reflects tissue- and cell-type-specific activity across pathological states and is a promising biomarker for organ injury and disease. Computational deconvolution methods are widely used to infer organ and cell-type contributions to cfRNA profiles. However, most were originally developed for single-tissue bulk transcriptomes and their performance in body-wide cfRNA settings, where any tissue or cell type can contribute, remains poorly characterised. Here, we present a systematic benchmarking of tissue- and cell type-of-origin deconvolution for plasma cfRNA that considers both methodological and reference-related sources of variability under realistic cfRNA simulation settings. We evaluated seven commonly used deconvolution methods across distinct algorithmic classes and multi-organ reference configurations derived from bulk and single-cell atlases. We assessed performance using simulation frameworks that model multi-organ mixtures, technical noise, and transcript degradation. We further examined deconvolution methods across multiple previously published clinical cfRNA cohorts spanning diverse disease contexts. Across both tissue- and cell-type-level analyses, deconvolution performance was strongly influenced by both method choice and reference parameters. Tissue-of-origin inference was comparatively robust across simulated and clinical datasets, recovering disease-associated organ signals and concordance with biochemical markers. In contrast, cell type-of-origin inference showed greater variability and reduced consistency across analytical settings, leading to divergent interpretations in both simulations and published clinical cfRNA cohorts. Together, these findings demonstrate that methodological and reference-related variability are major sources of uncertainty in cfRNA deconvolution, with tissue-level inference being more robust than cell-type-level inference. Our benchmarking framework provides guidance for reference selection and comparative interpretation in cfRNA deconvolution.
Aylward, A. J.; Petrus, S.; Mamerto, A.; Hartwick, N. T.; Michael, T. P.
Show abstract
SummaryPangenomes are replacing single reference genomes as the definitive representation of DNA sequence within a species or clade. Pangenome analysis predominantly leverages graph-based methods that require computationally intensive multiple genome alignments, do not scale to highly complex eukaryotic genomes, limit their scope to identifying structural variants (SVs), or incur bias by relying on a reference genome. Here, we present PanKmer, a toolkit designed for reference-free analysis of pangenome datasets consisting of dozens to thou-sands of individual genomes. PanKmer decomposes a set of input genomes into a table of observed k-mers and their presence-absence values in each genome. These are stored in an efficient k-mer index data format that encodes SNPs, INDELs, and SVs. It also includes functions for downstream analysis of the k-mer index, such as calculating sequence similarity statistics between individuals at whole-genome or local scales. For example, k-mers can be "anchored" in any individual genome to quantify sequence variability or conservation at a specific locus. This facilitates workflows with various biological applications, e.g. identifying cases of hybridization between plant species. PanKmer provides researchers with a valuable and convenient means to explore the full scope of genetic variation in a population, without reference bias. Availability and implementationPanKmer is implemented as a Python package with components written in Rust, released under a BSD license. The source code is available from the Python Package Index (PyPI) at https://pypi.org/project/pankmer/ as well as Gitlab at https://gitlab.com/salk-tm/pankmer. Full documentation is available at https://salk-tm.gitlab.io/pankmer/. Supplementary informationSupplementary data are available online
Groth, T. E.; Mishin, A. A.; Rao, V.; Tibet, R.; Troll, C. J.
Show abstract
Cell-free DNA methylation sequencing provides insight into tissue of origin and chromatin structure. In some workflows, generating libraries includes end-repair. Using matched single-stranded and double-stranded libraries prepared from the same cfDNA extracts, we show that end-repair in double-stranded DNA libraries reduces globally inferred CpG methylation leading to decreased tissue of origin accuracy. Trimming read termini partially mitigates this bias but decreases coverage and removes fragmentomic information compared to single-stranded DNA libraries, which forego end-repair.
Schikora-Tamarit, M. A.; Gabaldon, T.
Show abstract
Structural variants (SVs) like translocations, deletions, and other rearrangements underlie genetic and phenotypic variation. SVs are often overlooked due to difficult detection from short-read sequencing. Most algorithms yield low recall on humans, but the performance in other organisms is unclear. Similarly, despite remarkable differences across species genomes, most approaches use parameters optimized for humans. To overcome this and enable species-tailored approaches, we developed perSVade (personalized Structural Variation Detection), a pipeline that identifies SVs in a way that is optimized for any input sample. Starting from short reads, perSVade uses simulations on the reference genome to choose the best SV calling parameters. The output includes the optimally-called SVs and the accuracy, useful to assess the confidence in the results. In addition, perSVade can call small variants and copy-number variations. In summary, perSVade automatically identifies several types of genomic variation from short reads using sample-optimized parameters. We validated that perSVade increases the SV calling accuracy on simulated variants for six diverse eukaryotes, and on datasets of validated human variants. Importantly, we found no universal set of "optimal" parameters, which underscores the need for species-specific parameter optimization. PerSVade will improve our understanding about the role of SVs in non-human organisms.
Setter, D.; Lohse, K.; Baird, S. J. E.
Show abstract
Most ancestry-assignment methods rely on putatively pure reference panels, which are often unrealistic and bias inference. The genome polarisation algorithm diem, introduced previously, avoids reference panels by jointly inferring the polarity of common allelic states and quantifying variant diagnosticity via an expectation-maximisation procedure. Here we present diempy, an efficient python implementation of diem coupled with tools that turn polarised calls into analysis-ready outputs. diempy offers lossless VCF-to-diem BED conversion; ploidy-aware handling of individuals and chromosomes; flexible masking of sites, regions and individuals; and interactive visualisation of polarised genomes, hybrid indices, clines and ternary plots. Post-processing functions include DI thresholding, kernel smoothing, and automatic detection and run-length encoding of contiguous ancestry tracts. BED-based I/O facilitates integration with population-genomic workflows (e.g. filtering by annotation or ploidy). These features make reference-free genome polarisation with diempy practical and reproducible for studies of population structure, admixture and species barriers.
Liu, J.; Gibcus, J. H.; Dekker, J.
Show abstract
Chromatin loop calling from Hi-C data often exhibits substantial variability across related samples, limiting reproducibility and complicating comparative biological analyses. Conventional loop callers such as HiCCUPS are optimized for single-sample loop detection and are not designed for consistent comparison of loop positions across multiple datasets, e.g., across conditions or time points. Here, we present UnionLoops, a computational workflow for reproducible chromatin loop calling across multiple related samples. UnionLoops integrates information across datasets to determine positions and dataset-specificity of looping interactions. It constructs a unified candidate loop set, applies consistent filtering and aggregation, and evaluates loop support across samples to distinguish shared looping interactions from dataset-specific loop calls. Using time-course Hi-C datasets, we demonstrate that UnionLoops increases sensitivity for detecting shared chromatin loops, reduces spurious sample-specific calls, and improves concordance with independent genomic features, including CTCF and cohesin occupancy. These improvements support more reliable downstream analyses and enable improved biological interpretation of chromatin loop organization and dynamics across related experimental conditions.
Meyer, E.; Saldivar, E.; Kokot, M.; Xue, B.; Deorowicz, S.; Rhee, S. Y.; Salzman, J.
Show abstract
Most plant genomes and their regulation remain unknown. We used SPLASH - a new, reference-genome free sequence variation detection algorithm - to analyze transcriptional and post-transcriptional regulation from RNA-seq data. We discovered differential homolog expression during maize pollen development, and imbibition-dependent cryptic splicing in Arabidopsis seeds. SPLASH enables discovery of novel regulatory mechanisms, including differential regulation of genes from hybrid parental haplotypes, without the use of alignment to a reference genome.
Wilcox, J. J. S.; Foucault, Q. J.; Gossmann, T. I.
Show abstract
Tissues represent a fundamental evolutionary interface at the junction of genotype and phenotype. Indeed, gene regulation often occurs at the tissue level and manifests itself through tissue-specific epigenetic modifications. Studies investigating tissue epigenetics are limited by access to pure tissues. Tissues not only differ epigenetically, they are also subject to genetic differentiation through somatic mutations. As somatic mutations follow predictable patterns of inheritance, the application of population genomic approaches to inter- and intra-tissue variation could allow for the efficient detection of epigenetic modifications, even when tissue samples are convoluted. Here, we present an approach that uses de-novo somatic mutations to deconvolute 5mC methylation patterns through analysis of shifts in tissue-specific allele frequencies. We use simulations and bisulfite sequencing data to show that somatic mutations are common and detectable in next-generation sequencing data. We then use changes in mutation frequencies to accurately derive the proportional tissue of origin along a gradient of in silico subsamples of mixed-tissue bisulfite reads. We confirm that mixed tissues bias estimates of methylation levels and prevent detection of methylation differences at high levels of mixture. Our derived estimates of tissue contamination allow for unbiased and accurate deconvolution of mixed-tissue methylations in CpG and non-CpG context. We are ultimately able to recover 15-30% of differentially-methylated sites, and approximately 40-90% of differentially-methylated CpG islands and gene bodies in any cytosine context at contamination levels up to 90%. Our findings highlight the utility of population genomic approaches across scales, and expand the accessibility of epigenetics studies within evolutionary biology.
Wells, S. B.; Shahnawaz, H.; Jones, J. L.
Show abstract
dreampy is a Python implementation of the R dreamlet framework for pseudobulk differential expression analysis of single-cell RNA-seq data. dreamlet combines voom precision-weighted linear mixed models with empirical Bayes moderation to handle batch effects, repeated measures, and other hierarchical structure in multi-donor studies, but exists entirely within the R/Bioconductor ecosystem. dreampy reproduces this pipeline natively in Python, integrating with AnnData and the scverse ecosystem.
Soltys, V.; Peters, M.; Su, D.; Kucka, M.; Chan, Y. F.
Show abstract
Gene regulation underpins development and is an intricate biological process involving transcription, typically at promoters within accessible chromatin. To understand cell-type specific regulatory networks, the ability to capture both transcription and chromatin accessibility simultaneously is crucial. However, joint measurements are technically challenging and current methodologies still face adoption challenges. Here, we present easySHARE-seq, an improvement on SHARE-seq, for the simultaneous measurement of ATAC- and RNA-seq in single cells. We address several limitations of the previous method by improving the barcode and streamlining the protocol. As a result, easySHARE-seq libraries have a usable sequence of up to 300bp (+200bp increase), making it suitable for e.g. investigation of allele-specific signals or variant discovery. Furthermore, easySHARE-seq libraries do not require a dedicated sequencing run thus saving costs. We applied easySHARE-seq to murine liver nuclei and recovered 19,664 nuclei with joint chromatin and expression profiles. By benchmarking against other combinatorial indexing-based techniques, we showed we can recover over 1.5 fold more transcripts per cell while retaining high scalability and low cost. To showcase our method, we identified cell types, exploited the multiomic measurements to link cis-regulatory elements to their target genes and investigated liver-specific micro-scale changes. We conclude that easySHARE-seq improves upon previous methods and can produce high-quality multiomic datasets. We expect it to be applicable to a wide range of study designs.
Hu, Z.; Przytycki, P. F.; Pollard, K. S.
Show abstract
CellWalker2 is a graph diffusion-based method for single-cell genomics data integration. It extends the CellWalker model by incorporating hierarchical relationships between cell types, providing estimates of statistical significance, and adding data structures for analyzing multi-omics data so that gene expression and open chromatin can be jointly modeled. Our open-source software enables users to annotate cells using existing ontologies and to probabilistically match cell types between two or more contexts, including across species. CellWalker2 can also map genomic regions to cell ontologies, enabling precise annotation of elements derived from bulk data, such as enhancers, genetic variants, and sequence motifs. Through simulation studies, we show that CellWalker2 performs better than existing methods in cell type annotation and mapping. We then use data from the brain and immune system to demonstrate CellWalker2s ability to discover cell type-specific regulatory programs and both conserved and divergent cell type relationships in complex tissues.
Trimbour, R.; Saez-Rodriguez, J.; Cantini, L.
Show abstract
Chromatin 3D folding creates numerous DNA interactions, participating in gene expression regulation. Single-cell chromatin-accessibility assays now profile hundreds of thousands of cells, challenging existing methods for mapping cis-regulatory interactions. We present CIRCE, a fast and scalable Python package to predict cis-regulatory DNA interactions from single-cell chromatin accessibility data. CIRCE re-implements the Cicero workflow to analyse single-cell atlases, cutting runtime and memory use by several orders of magnitude. We also provide new options to compute metacells, grouping similar cells to reduce data sparsity. We benchmarked CIRCE against Cicero on two datasets of different sizes and demonstrated the improvement from CIRCEs metacells strategy with promoter capture Hi-C data. We also evaluated how DNA interaction predictions are impacted by different pre-processing. We observed a negative impact of Ciceros count normalization, and the best performance was obtained with the single-cell count matrix directly. Finally, we demonstrated the scalability of CIRCE by processing a dataset of more than 700000 cells and 1 million DNA regions in less than an hour. CIRCE should greatly facilitate the prediction of DNA region interactions for scverse and Python users, while providing new and up-to-date pre-processing insights. Availability and reproducibilityCIRCE is released as an open-source software under the AGPL-3.0 license. The package source code is available on GitHub at https://github.com/cantinilab/CIRCE, and its documentation is accessible at https://circe.readthedocs.io. The code to reproduce the presented results is available as a Snakemake pipeline at https://github.com/cantinilab/circe_reproducibility.
Yu, M.; Abnousi, A.; Zhang, Y.; Li, G.; Lee, L.; Chen, Z.; Fang, R.; Wen, J.; Sun, Q.; Li, Y.; Ren, B.; Hu, M.
Show abstract
Single cell Hi-C (scHi-C) analysis has been increasingly used to map the chromatin architecture in diverse tissue contexts, but computational tools to define chromatin contacts at high resolution from scHi-C data are still lacking. Here, we describe SnapHiC, a method that can identify chromatin loops at high resolution and accuracy from scHi-C data. We benchmark SnapHiC against HiCCUPS, a common tool for mapping chromatin contacts in bulk Hi-C data, using scHi-C data from 742 mouse embryonic stem cells. We further demonstrate its utility by analyzing single-nucleus methyl-3C-seq data from 2,869 human prefrontal cortical cells. We uncover cell-type-specific chromatin loops and predict putative target genes for non-coding sequence variants associated with neuropsychiatric disorders. Our results suggest that SnapHiC could facilitate the analysis of cell-type-specific chromatin architecture and gene regulatory programs in complex tissues.
Arzalluz-Luque, A.; Salguero, P.; Tarazona, S.; Conesa, A.
Show abstract
Alternative splicing (AS) is a highly-regulated post-transcriptional mechanism known to modulate isoform expression within genes and contribute to cell-type identity. However, the extent to which alternative isoforms establish co-expression networks that may relevant in cellular function has not been explored yet. Here, we present acorde, a pipeline that successfully leverages bulk long reads and single-cell data to confidently detect alternative isoform co-expression relationships. To achieve this, we developed and validated percentile correlations, a novel approach that overcomes data sparsity and yields accurate co-expression estimates from single-cell data. Next, acorde uses correlations to cluster co-expressed isoforms into a network, unraveling cell type-specific alternative isoform usage patterns. By selecting same-gene isoforms between these clusters, we subsequently detect and characterize genes with co-differential isoform usage (coDIU) across neural cell types. Finally, we predict functional elements from long read-defined isoforms and provide insight into biological processes, motifs and domains potentially controlled by the coordination of post-transcriptional regulation.